We present efficient realization of Householder Transform (HT) based QRfactorization through algorithm-architecture co-design where we achieveperformance improvement of 3-90x in-terms of Gflops/watt over state-of-the-artmulticore, General Purpose Graphics Processing Units (GPGPUs), FieldProgrammable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical andexperimental analysis of classical HT is performed for opportunities to exhibithigher degree of parallelism where parallelism is quantified as a number ofparallel operations per level in the Directed Acyclic Graph (DAG) of thetransform. Based on theoretical analysis of classical HT, an opportunityre-arrange computations in the classical HT is identified that results inModified HT (MHT) where it is shown that MHT exhibits 1.33x times higherparallelism than classical HT. Experiments in off-the-shelf multicore andGeneral Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest thatMHT is capable of achieving slightly better or equal performance compared toclassical HT based QR factorization realizations in the optimized softwarepackages for Dense Linear Algebra (DLA). We implement MHT on a customizedplatform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x betterperformance than native implementation of classical HT on the same accelerator.For custom realization of HT and MHT based QR factorization, we also identifymacro operations in the DAGs of HT and MHT that are realized on aReconfigurable Data-path (RDP). We also observe that due to re-arrangement inthe computations in MHT, custom realization of MHT is capable of achieving 12%better performance improvement over multicore and GPGPUs than the performanceimprovement reported by General Matrix Multiplication (GEMM) over highly tunedDLA software packages for multicore and GPGPUs which is counter-intuitive.
展开▼
机译:我们提出了通过算法-架构协同设计有效实现基于Householder Transform(HT)的QR分解的方法,与最先进的多核通用图形处理单元(GPGPU)相比,我们实现了3-90倍Gflops / watt的性能提升),现场可编程门阵列(FPGA)和ClearSpeed CSX700。进行经典HT的理论和实验分析是为了表现出更高的并行度,其中并行度被量化为变换的有向无环图(DAG)中每级的并行操作数量。基于对经典HT的理论分析,确定了经典HT中的机会重排计算,从而导致修改后的HT(MHT),其中表明MHT的并行度比经典HT高1.33倍。在针对HT和MHT的现成的多核和通用图形处理单元(GPGPU)中进行的实验表明,与Dense Linear Algebra(DLA)的优化软件包中基于经典HT的QR因式分解实现相比,MHT能够实现稍好或相等的性能。我们在密集线性代数(DLA)的定制平台上实现了MHT,并证明了MHT的性能比在同一加速器上经典HT的本地实现要好1.3倍。对于HT和基于MHT的QR因式分解的自定义实现,我们还确定了DAG中的宏操作HT和MHT在可重配置数据路径(RDP)上实现。我们还观察到,由于MHT中计算的重新安排,MHT的自定义实现比通用矩阵乘法(GEMM)所报告的性能经过高度调整的DLA软件包对多核和GPGPU的性能提高了12%。 GPGPU违反直觉。
展开▼